Deep learning can yield clinically useful right ventricular segmentations faster than fully manual analysis

Right ventricular (RV) volumes are commonly obtained through time-consuming manual delineations of cardiac magnetic resonance (CMR) images. Deep learning-based methods can generate RV delineations, but few studies have assessed their ability to accelerate clinical practice. Therefore, we aimed to develop a clinical pipeline for deep learning-based RV delineations and validate its ability to reduce the manual delineation time. Quality-controlled delineations in short-axis CMR scans from 1114 subjects were used for development. Time reduction was assessed by two observers using 50 additional clinical scans. Automated delineations were subjectively rated as (A) sufficient for clinical use, or as needing (B) minor or (C) major corrections. Times were measured for manual corrections of delineations rated as B or C, and for fully manual delineations on all 50 scans. Fifty-eight % of automated delineations were rated as A, 42% as B, and none as C. The average time was 6 min for a fully manual delineation, 2 s for an automated delineation, and 2 min for a minor correction, yielding a time reduction of 87%. The deep learning-based pipeline could substantially reduce the time needed to manually obtain clinically applicable delineations, indicating ability to yield right ventricular assessments faster than fully manual analysis in clinical practice. However, these results may not generalize to clinics using other RV delineation guidelines.

For the main dataset, all delineations were performed by experienced physicians or MD-PhD students. In the latter case, the quality and agreement with consensus guidelines 22 of the delineations were verified by experienced physicians.
Before inclusion, all delineations were subject to a quality control performed by one of the observers, to guarantee adequate concordance with the consensus guidelines 22 and hence suitability for inclusion. Subjects with complex congenital heart defects (e.g. situs inversus, transposition of the great arteries, and univentricular hearts) were excluded. Subjects that only had RV delineations in non-short-axis images were excluded. The quality control reduced the initial 2490 examinations to 1693 examinations, corresponding to 1434 unique subjects. A single subject could have had several examinations.
The main dataset (n = 1434) was partitioned into a training set (n = 1114) and a test set (n = 320). Some examinations had delineations that were approved only in one of the two timeframes (ED or ES) during the quality control. From these examinations, the timeframes with approved delineations were used for the training set. For the test set, only examinations with delineations approved during the quality control in both ED and ES were included. Due to this, 104 subjects were excluded, leaving 216 subjects in the refined test set (TS).
To assess generalizability, the trained pipeline was applied to the Automatic Cardiac Diagnosis Challenge (ACDC) testing dataset and externally validated using the online challenge evaluation platform 23 . This dataset consists of short-axis CMR images of 10 healthy and 40 pathological subjects 23 . To further validate the segmentation precision of the pipeline, a scan-rescan assessment was performed using CMR images of 10 additional healthy subjects from two different scanning occasions. For 3 of the subjects, the rescans were performed using different scanners.

Pipeline.
A flowchart of the pipeline developed for this study can be seen in Fig. 2. The required input to the pipeline was a stack of short-axis images in one timeframe and its pixel spacing (the distance between pixels, in mm). The input timeframe was pre-processed by slice-wise re-sampling to the median pixel spacing of the training set (1.07 × 1.07 mm) through bilinear interpolation. After pre-processing, three subsequent CNNs were applied to the timeframe. The purpose of the two first CNNs was to handle the size variability of clinical CMR images and robustly pre-process the input data, making the third CNN (the segmentation CNN) implementable in a clinical setting. Details on each used CNN architecture are provided in Supplementary Methods S2.
The first CNN (the slice selection network) selected each slice of the input timeframe that contained a crosssection of the RV, through binary classification. The first and last selected slice were used to mark the boundaries of the RV in the slice direction. This was done in concordance with the method used by Berggren et al. 24 .
Each slice within the selected bounds of the RV was inserted into the second CNN (the RV center point detection network) that generated the coordinates of the center point of the RV cross-section in each slice. The RV center points were used to crop (or zero-pad) each slice to 256 × 256 pixels, giving all slices the same field of view and a centered RV. Each such slice was then subject to Z-score intensity normalization and inserted into the third CNN (the segmentation network) that generated a binary mask for each slice, where the cross-section of the RV endocardium was the foreground. Each generated segmentation mask was then inversely padded or cropped Figure 2. A flowchart of the pipeline. An input timeframe is pre-processed and inserted into a slice selection network that selects slices containing RV cross-sections. The RV center point is then detected in each selected slice by the RV center point detection network and used to crop (or pad) each slice around the RV before insertion into the segmentation network. This yields a segmentation of the RV, that is then inversely padded or cropped to match the original slice size. www.nature.com/scientificreports/ back to the original size of the input image, using the RV center point for positioning. As a post-processing step, only the largest connected component in each mask was taken as the RV and other components were removed.
Overview of CNN training and pipeline evaluation. The three CNNs were trained separately using differently pre-processed versions of the training set with different ground truth definitions that matched the respective subtask of the given CNN (see Supplementary Methods S3). For all three CNN types, hyperparameter optimization was carried out using grid search five-fold cross validation on the training set. The optimized sets of hyperparameters were then used to train the final models on the full training set. Training details can be found in Supplementary Methods S3. A performance evaluation of the final pipeline was carried out on the refined test set (TS). Although quality was controlled by a level 3 CMR reader, some of the delineations in TS were performed by non-level 3 readers. Therefore, the TS delineations were not used for validating the clinical benefits of the pipeline, but only for initial pipeline evaluation. The results from this can be found in Supplementary Results S4.
Validation of clinical benefit using the clinical validation set (CVS). The clinical validation set (CVS) was collected from fifty consecutive and unselected short-axis CMR examinations from clinical scans at Skåne University Hospital. Each examination was from a unique subject. The observers were blinded to all clinical and diagnostic information.
Automated delineations in ED and ES for all subjects of the CVS and TS were first generated using the pipeline. To define whether the delineations were of clinically adequate quality, O1 (expert core-lab reader 25,26 ) performed a visual rating of the overall quality of the automated delineations in both ED and ES for each subject according to a three-level modified Likert scale of decreasing delineation quality as A (sufficient for clinical use), B (needing minor corrections) or C (needing major corrections) for both the CVS and TS. Level A delineations were sufficiently close to fulfilling the consensus guidelines to likely not affect diagnosis. Level B and C delineations both deviated from the consensus guidelines enough to likely affect diagnosis but required two different levels of effort for correction. Examples of these ratings can be found in Supplementary Videos S6,S7,S8 and Supplementary Methods S5.
An assessment of the time consumption for obtaining delineations of clinically adequate quality was conducted on the CVS by O1 and O2. To do this, the delineation runtimes for the pipeline on an NVIDIA Quadro T1000 GPU were measured. Then, O1 and O2 performed timed, manual corrections of the automated delineations that had been rated as B or C. In Segment 18 , the generated contours are displayed using spline points, which allows for fine adjustments to be made to specific parts of the delineation. Lastly, the two observers each performed timed, manual delineations without prior automated delineations and according to the consensus guidelines 22 , for all subjects in the CVS. Reference volumes were computed as the mean volumes between observers. All measured times were for delineating both the ED and ES timeframes.

Statistics.
Normality of distributions were assessed through the Anderson-Darling test. Bias was assessed according to Bland-Altman analysis 27 for both volume and ejection fraction (EF), and Spearman's rank correlation coefficients (r) were used for comparative analyses. Relative bias was expressed as percentages of mean reference volumes. Two-sided Wilcoxon signed rank tests were used for the comparison between delineation times. Two-sample F-tests were used for determining the significance of differences in variabilities (variances). A two-sided p value lower than 0.05 was considered statistically significant. Segmentation performances were evaluated using Dice score (Sørensen-Dice coefficient 28
For the CVS, the mean (± SD) volumetric Dice scores between the pipeline and delineations by O1 were 0.91 ± 0.02 for ED and 0.87 ± 0.04 for ES, and for delineations by O2 the Dice scores were 0.91 ± 0.02 for ED and 0.88 ± 0.03 for ES. The mean (± SD) volumetric Hausdorff distances between the pipeline and delineations by O1 were 8.47 ± 1.32 for ED and 7.71 ± 0.96 for ES, and for delineations by O2 the Hausdorff distances were 8.11 ± 1.36 for ED and 7.30 ± 1.22 for ES. Between O1 and O2, the Dice scores were 0.91 ± 0.03 for ED and 0.86 ± 0.04 for ES, and the Hausdorff distances were 8.53 ± 1.29 for ED and 7.66 ± 1.26 for ES.
For the ratings of the automated delineations on the CVS, the rating A (sufficient for clinical use) dominated, comprising 58% of all subjects. The remaining automated delineations were rated as B (needing minor adjustments), and none were assessed to be in need of major corrections (C). For the test set (TS), 38% of delineations were rated as A, 50% as B, and 12% as C.
For the 21 delineations in CVS deemed in need of corrections, the volumetric inter-observer agreement was significantly improved (p = 2.23e−02 for ED and p = 5.00e−03 for ES, n = 21) (Fig. 5).
For O1, the average manual delineation time was 6 min and 46 s, and the average correction time was 1 min and 38 s (p = 5.96e−05, n = 21). For O2, the corresponding times were 5 min and 19 s as well as 2 min and 30 s (p = 3.86e−04, n = 18 due to missing data). The mean runtime for the pipeline per subject (in ED and ES) was 2 s

Discussion
We have developed a CNN-based pipeline for RV segmentation and shown its ability to reduce the manual delineation time for obtaining delineations of clinically useful quality according to expert reader ratings. However, the pipeline's reduced performance when externally validated indicates that time reduction may not be fulfilled for clinics using other RV delineation guidelines. More than half of the automated delineations on the CVS were rated as of clinically sufficient quality without the need for manual corrections. These delineations were obtained in a matter of seconds using a powerful GPU in a laptop. The remaining delineations were rated as only in need of minor corrections, and the correction time for these was about a third of the manual delineation time. For the TS, 88% of delineations were clinically adequate or in need of minor corrections.
Even if both observers were of CMR level 3, an average of 87% of the average manual delineation time could be saved by using the pipeline. Less experienced observers may require longer manual delineation times, and a larger reduction in delineation time by use of the pipeline could therefore be expected. For example, Caudron et al. (2011) reported mean RV delineation times (for both endocardial and epicardial borders on 60 patients) of 13.4 and 18.9 min for observers with 3 and 1 years of training, respectively 29 . A result of a reduction in delineation time could be that an increased number of patients could be assessed in a day.
We have shown that the use of corrected pipeline delineations was able to decrease the inter-observer variability compared to manual delineations. In a study by Bai et al. (2018), the variability of a CNN-based cardiac segmentation method was on par with the inter-observer variability 15 . Our pipeline-observer variability was lower than the inter-observer variability, although only significantly for O1 in ES (p < 0.05). A reason for the substantial ES variability between observers could be that trabeculations may be harder to distinguish from the endocardial wall when the ventricle is contracted. Thus, even if human observers are trained in the same www.nature.com/scientificreports/ institution and adhere to the same consensus guidelines, they can still vary, especially when delineating the RV in ES. This shows the difficulty of the segmentation problem, and indicates that a common starting point could be helpful for decreasing the inter-observer variability (as seen in Fig. 5). Nonetheless, this also shows the difficulty of validating the performance of automated segmentation methods. Since the main dataset consisted of delineations from several observers, it is likely that delineation variabilities of a magnitude similar to the one between observers in ES were present. This could have affected the learning process negatively, by the ground truth examples showing contradictory ways of delineating the RV. However, it could also have affected the learning process positively, by allowing the segmentation network to learn a middle ground between the separate observers' individual opinions on RV delineations. The pipeline-observer bias was lower than between observers for EF and for volumes in ES, and of similar magnitude for volumes in ED, with no observable negative or positive trend.
Larger RVs did not seem to affect the delineation performance of the pipeline in a negative way (Fig. 4), showing the potential to use it in a clinical environment where both healthy cases and pathological cases (with enlarged RVs) may be present. This is important for clinical settings with both pediatric and adult cases. With a decreased inter-observer variability for RV delineations, the possibility to detect true volumetric changes between examinations of a patient at different time points would increase, making CMR an even more robust method for determining changes and trends in pathological states. Moreover, it is important that RV delineations yield replicable volumes close to their true absolute magnitude, since large variations in delineations could mislabel an RV as healthy or pathological 30 purely due to volumetric errors. The low bias indicates that this might not be a problem for the pipeline. The scan-rescan assessment indicated that inter-scan differences could affect the robustness of the pipeline in some cases. However, no manual reference segmentations were available to determine if the effect would be similar for human observers.
The segmentation network architecture, the 2D U-Net 31 (see Supplementary Methods S2 for details), is widely used in recent medical image segmentation literature 32 , and its functionality for cardiac segmentation is well described 16 . The novelty of our approach consists of extending this architecture with two additional CNN architectures for robustly pre-processing the CMR images before segmentation, with the purpose of handling the wide variability of CMR data that can be encountered in a clinical setting.
Although the used combination of CNN architectures is new, there exists many previous studies that have presented modifications or additions to segmentation network architectures similar to the one we have used [5][6][7][8][9]33 . However, unlike this study, most previous studies do not have clinical implementation as the end goal of their method development. Instead, a frequent goal is to create a method that improves upon the segmentation performance of existing deep learning-methods. Performance improvements are commonly assessed using public datasets for training and testing, and evaluations are often performed in a challenge setting. As of November 2022, the 15 RV delineation methods on the Automated Cardiac Diagnosis Challenge post-2017-MICCAI-challenge testing phase leaderboard 23  The pipeline's performance on the same dataset was within these ranges, except for ED mean Hausdorff. However, our pipeline was trained using in-house data while the methods on the leaderboard had used the designated ACDC training dataset from the same domain as the testing dataset 23 . This disallows direct comparisons between the learning methodologies. Nonetheless, these results indicate that the trained pipeline can yield delineations that somewhat generalize to segmentation guidelines that differ from those used in its training dataset. Yet, it might not generalize well enough to yield a time reduction as substantial as that observed on the in-house data.
To our knowledge, there is no previous RV segmentation study that has presented the full process from clinical pipeline development to a clinical validation that focuses on the important aspect of time reduction. Existing clinical validation studies 16,17 base their analyses on commercial deep learning-based methods and provide limited details on development and function. The pipeline was implemented in the clinical software Segment CMR and made freely available to the research community in the software Segment 18 . We consider this study to be a step forward for the clinical use of deep learning-based RV segmentation methods, by motivating clinicians that still perform fully manual RV segmentations to start using these methods, and by showing clinicians that already use these methods that they can in fact be beneficial, despite the need for corrections.
Limitations. The delineation quality ratings for defining the quality level of a clinically useful delineation were done by a single observer (O1). Since they are subjective, they may vary between observers, making the presented ratings apply only to the observer that carried them out. However, before the ratings were carried out, the two observers together assessed delineations on 10 separate subjects to ensure a uniform assessment. They also adhered to the same consensus guidelines 22 . Thus, even though the limits of agreement between observers were somewhat wide (Fig. 4), they still have the same general understanding of how the RV should be delineated. It is thus likely that their quality ratings would be similar, and also resemble those of other observers that use the same guidelines. For centers using different guidelines such as excluding trabeculations and papillary muscles from the volumes, the time reduction by using the pipeline was not assessed.
Due to the removal of clinical information during anonymization, it was not possible to assess whether the manual corrections could yield changes in pathological classification. Also, no pathology-specific performance testing could be done. Therefore, no assessment could be made regarding the effect of pathologies on pipeline performance. Nonetheless, our training set included a wide range of volumes and subjects referred to cardiac MR at a large university hospital, making it likely that the pipeline is applicable to pathological cases. Furthermore, the pipeline used short-axis images, and did not make use of transversal (axial) images, which could be a www.nature.com/scientificreports/ limitation for some institutions. Also, a more robust segmentation model could potentially have been obtained through experimenting more with different types of data augmentation. We abstained from performing any additional training of our pipeline on any public datasets (e.g. the ACDC training set) because: (a) we used a segmentation network with well-described cardiac segmentation abilities; (b) we believe that the strength of our method can largely be attributed to the width of the training dataset that was used; and (c) the purpose of this study was to validate the clinical usefulness of our pipeline, and since retraining the networks of our pipeline using a different dataset would change the properties of our pipeline, we consider it to be outside the scope of this study.
This study only assessed a reduction in segmentation time compared to fully manual segmentations and did not consider other segmentation methods used in contemporary clinical practice. Moreover, this study showed the possibility to accelerate a single segmentation task. However, these results are not sufficient to determine if the reduced cost from the accelerated segmentation overweighs the potential cost of commercialized deep learning-based segmentation software.

Conclusion
A deep learning-based clinical pipeline could substantially reduce the time needed to manually obtain RV delineations of clinically sufficient quality, even when considering the occasional need for manual corrections. This indicates that deep learning-based methods can yield right ventricular assessments faster than fully manual analysis in clinical practice, but it remains to be assessed how the observed time reduction would change for clinics adhering to other delineation guidelines.

Data availability
Data supporting the findings of this study are available from Skåne University Hospital, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Aggregated data are however available from the authors upon reasonable request and with permission of Skåne University Hospital. The pipeline is implemented in the software Segment (http:// segme nt. heibe rg. se) and is freely available for research purposes.